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A Random Force is a Force, of Course, of Coarse: 
Decomposing Complex Enzyme Kinetics with Surrogate Models 
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The temporal autocorrelation (AC) function associated with monitoring order parameters charac- 
terizing conformational fluctuations of an enzyme is analyzed using a collection of surrogate models. 
The surrogates considered are phenomenological stochastic differential equation (SDE) models. It is 
demonstrated how an ensemble of such surrogate models, each surrogate being calibrated from a sin- 
gle trajectory, indirectly contains information about unresolved conformational degrees of freedom. 
This ensemble can be used to construct complex temporal AGs associated with a "non-Markovian" 
process. The ensemble of surrogates approach allows researchers to consider models more flexible 
than a mixture of exponentials to describe relaxation times and at the same time gain physical 
information about the system. The relevance of this type of analysis to matching single-molecule 
experiments to computer simulations and how more complex stochastic processes can emerge from 
a mixture of simpler processes is also discussed. The ideas are illustrated on a toy SDE model and 
on molecular dynamics simulations of the enzyme dihydrofolate reductase. 

PACS numbers: 82.39.Fk, 87.15. Vv,02.50.-r 87.10.Mn 



I. INTRODUCTION 

When enzymes and other proteins are probed at the 
single-molecule level, it has been observed in both exper- 
iments and simulation studies [1, i, 0, B i, 
that conformational fluctuations at several disparate 
timescales have physically significant influence on both 
large scale structure and biochemical function. In this 
article, a method for using a collection of Marovian sur- 
rogate models [ll|, [H, to predict kinetics that would 
often be considered non-Markovian is presented [13, [l^ . 
The ideas in [l^ are extended to treat a system with 
more complex kinetics. The aim of the approach is to 
obtain a better quantitative understanding the factors 
contributing to complex time autocorrelations (AGs) as- 
sociated with quantities modulated by slowly evolving 
conformational degrees of freedom. The focus is on sys- 
tems where certain thermodynamically important con- 
formational degrees of freedom evolve over an effective 
free energy surface with relatively low-barriers; this sit- 
uation is often relevant to molecules undergoing a "pop- 
ulation shift" or "selected-fit" mechanism 0, and the 
connection to "dynamic disorder" [l[ is also discussed. 
The particular enzyme studied is dihydrofolate reductase 
(DHFR) because of its biological relevance to therapeu- 
tics and also due to the rich kinetics observed in certain 
order parameters [1]. 

Surrogate models are used to describe the short time 
dynamics. These surrogates are fairly simple phenomeno- 
logical parametric stochastic differential equation (SDE) 
models. Specifically the Ornstein-Ulhlcnbcck process and 
an overdamped Langevin e quat ion with a position depen- 
dent diffusion function pH Il3l. ITgI . ITtI are considered as 



the candidate surrogate models. Position dependent dif- 
fusion is often observed when a few observables (or order 
parameters) are used to describe an underlying complex 
system such as a protein. Position dependent noise mod- 
els allow one to consider AGs having a different functional 
form than an exponential decay and it is demonstrated 
that this added flexibility can be of assistance in both 
understanding short and long timescale kinetics. Maxi- 
mum likelihood type estimates utilizin g tr ansition den- 
sities, exact and approximate [H, ITol. [20|. are used to 
fit our surrogate SDE models. The fitting method does 
not require one to discretize [l^ state space (the sur- 
rogates assume a continuum of states). The temporal 
AC is not used directly as a fitting criterion [l^ [2l| . 
but the surrogate models are able to accurately predict 
the AG after the model parameters are fit. Maximum 
likelihood based approaches employing accurate transi- 
tion density approximations and a parametric structure 
posses several advantages in this type of application . 
An accurate transition density of a parametric SDE facili- 
tates goodness-of-fit tests appropriate for both stationary 
(23I [2JI and nonstationary time series [25l| . The latter is 
particularly relevant to many systems (like the one con- 
sidered here) where the diffusion coefficient is modulated 
by factors not directly monitored [l^, and the hypoth- 
esis of a flxed/stationary surrogate model describing the 
modeled time series is questionable. Statistically testing 
the validity of various assumptions explicitly or implicitly 
behind a candidate surrogate model, such as Markovian 
dynamics, state dependent noise and/or regime switch- 
ing is helpful in both experimental and simulation data 
settings [2l,[27|. 

The type of modeling approach presented is attractive 
from a physical standpoint for a variety of reasons. The 
items that follow are discussed further in the Results and 
Discussion: 
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• In situations where the magnitude of the local flue- 
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tuations depend significantly on the instantaneous 
value of the order parameter monitored, simple ex- 
ponential (or a finite mixture of exponentials psj ) 
can be inadequate to describe the relaxation and/or 
AC function 13]. The surrogates proposed can 
account for this situation when overdamped diffu- 
sion models can be used; additionally the estimated 
model parameters can be physically interpreted. 

• It has been observed that even in single-molecule 
trajectories that "dynamic disorder" can be ob- 
served due to ignoring certain conformational de- 
grees of freedom [l|, U ; the methods proposed 
here can be used to account for this type of vari- 
ability and show promise for comparing frequently 
sampled single-molecule experimental time series 
to computer simulations where dynamic disorder 
is relevant. 

• Changes in conformational fluctuation magnitudes 
have been suggested to lead to physically interest- 
ing phenomena, so possessing a means for quan- 
titatively describing an ensemble of dynamical re- 
sponses can help one in better understanding the 
complex dynamics of enzymes, e.g. [sl. IgI. ItI [io| . 

• There is general interest in showing how more com- 
plex stochastic processes arise from a collection of 
simpler parts [l^, [sO, [lli [12]. We discuss how, 
within a single trajectory, a continuous type of 
regime switching of Markovian surrogate models 
gives rise to an AC that would be considered non- 
Mar kovian. 

The remainder of this article is organized as follows: 
Section |TT] reviews the background and presents the mod- 
els considered. Section IIIII introduces the new model- 
ing procedure used to approximate the AC function of 
a molecule experiencing multiple types of fluctuations. 
Section [TVl presents the Results and Discussion, Section 
|V] provides the Summary and Conclusions and this is fol- 
lowed by an Appendix. 



II. BACKGROUND AND METHODS 



can often approximate the effective stochastic dynamics 
of the order parameter [ll|, [l^. In the above and 
cr^(-) are the nonlinear deterministic drift and diffusion 
functions (respectively) and Bt represents the standard 
Brownian motion [33] . The finite dimensional parameter 
vector is denoted by 9 and T is used to represent unre- 
solved lurking degrees of freedom that slowly modulate 
the dynamics [s^ ]. 

The surrogate SDE models are formed by first divid- 
ing each trajectory into L temporal partitions. Each es- 
timated parameter vector is denoted by 0^ using the se- 
quence {zi}Jlrp^ ^ and an assumed model , where £ is an 
index of a partition, 1 =: Tq < . . . < < . . . < Tl := N , 
used to divide a time series into L disjoint local tempo- 
ral windows. Within each of these windows, the data 
and the assumed model structure is used to cornpute 9i 
using maximum likelihood type methods (exact [20| and 
approximate depending on the model). The para- 
metric structures considered are presented in the next 
section. It is to be stressed that we estimate a collection 
of models, {di}f^i for each trajectory. The differences 
in the estimated parameters are due in part to random 
slowly evolving forces modulate the dynamics and also 
in part to unavoidable estimation uncertainty associated 
with a finite time series. It is demonstrated that a collec- 
tion of surrogate model parameter vectors is needed to 
summarize conformational fluctuations inherent to many 
complex biomolecules. This procedure is repeated for 
each observed MD trajectory / time series. 

The term "local diffusion coefficient" = D{z;r) := 
cr^(z; 9, r) is introduced in order to distinguish the coef- 
ficient in the Eqn. [T]from the diffusion coefficient usually 
implied in the physical sciences: we estimate the former 
from observed data. The term "diffusion coefficient" used 
in statistical physics [SS^ is not necessarily the same as 
D{z;r). If r does not modulate the dynamics, the two 
definitions are effectively identical. However one theme 
of this paper is that some traditional dynamical sum- 
maries of statistical physics, such as diffusion coefficient 
and ensemble based AC, can be modified or made less 
coarse by using a collection of surrogate models. Such a 
procedure may help in intprctting/undcrstanding single- 
molecule time series. 



A. Effective Dynamics and Statistical Inference 



B. Candidate Surrogate SDE Models 



The trajectory generated by a detailed molecular dy- 
namics (MD) simulation will be denoted by The 
dynamics of the order parameter monitored is assumed 
to be complex (nonlinear, modulated by unobserved fac- 
tors, etc.) even at the relatively short 0{ns) time in- 
tervals the order parameter time series is observed over. 
However, over short ~ 50 — lOOps time intervals a con- 
tinuous stochastic differential equation (SDE) having the 
form 



dzt = /i(zt; 9, r)dt + V2a{zt; 9, T)dBt 



(1) 



Two local parametric SDE models consid- 
ered. MATLAB scripts illustrating how to ob- 
tain parameter estimates of both models from 
discretely observed data are available online 
'www. caEun. rice . edu/techjreports/2008^bstracts .htiiil#TR08-2 
The first is a linear, constant additive noise process: 



dzt = B{A - zt)dt + V2CdBt 



(2) 



The above SDE has a rich history in both the physical sci- 
ences [3^ where it is usually referred to as the Ornstein- 
Uhlenbeck (OU) process and in econometrics where it is 
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sometimes referred to as the Vasicek process [20'] . The pa- 
rameter vector to estimate in this model is 9 = {A, B,C). 
This model is appealing for a variety of reasons, one being 
that the exact transition density and maximum likelihood 
parameter vector for a discretely sampled process [56] can 
be written in closed-form, i.e. a numerical optimization 
is not needed to find the parameter vector because the 
parameter estimate can be written explicitly in terms of 
9 and the observed data [lO] ■ 

The second is a nonlinear, position dependent over- 
damped (PDOD) Langevin type SDE [ll|, 111, i37,] : 

dzt = /3(C + D{zt - ^Po)y [a + B{zt - ^o))dt 

+V2{c + Dizt-^jjo))dBt. (3) 

The variable (3 = \/{kBT) is the inverse of the product of 
Boltzmann's constant and the system temperature. ■00 
represents a free parameter; in this article it coincides 
with the umbrella sampling point specified in the simu- 
lation. The parameter vector to estimate in this model 
is 9 = {A, B,C, D). Each parameter is estimated using 
the observed data and the transition density expansions 
associated with Eqn. [3] is used to construct a log 
likelihood cost function. A Nelder-Mead search is then 
used to find the 9 maximizing the associated cost func- 
tion. The effective force in the above model is assumed 
to be linear in z, e.g. F(z) := ^ -I- B{z — tpo) whereas 

the diffusion function = D(z] T) := (^C + D{z — ■0o)^ is 

quadratic in z. The overdamped appellation comes from 
multiplying the effective force by the effective friction (as 
determined by the Einstein relation [37j) corresponding 
to this diffusion function. 

In this article, all stochastic integrals used are Ito 
integrals. When a complex high-dimensional system 
with multiple timescales is approximated with a low di- 
mensional SDE possessing position dependent noise the 
choice of the Ito or Stratonovich integral influences the 
interpretation of the drift function and the issue of which 
interpretation is "physically correct" is a nontrivial prob- 
lem 38, [3^. A related item is the so-called "noise- 
induced drift" [13, Elj. Such a term is sometimes ex- 
plicitly added to the drift [4l[, one thermodynamic mo- 
tivation for this is discussed further in Section fVlI Bl 

An appealing feature of the data-driven modeling pro- 
cedure presented here and elsewhere [ll|, [l^ [2^ [SJ, H^] 
is that various SDE models, of an explicitly specified 
form, can be considered, estimated, and tested using ob- 
served trajectories. Statistical hypothesis tests making 
use of the conditional distribution (not just moments) 
of the assumed surrogate model can then be used to 
test if the model assumptions are justified for the ob- 
served data. Tools from mathematical statistics [2^, [2^ 
facilitate quantitatively and rigorously testing if certain 
features are required to adequately describe the stochas- 
tic dynamics. Many features, e.g. position-dependent 
noise, would be hard to statistically check using AC based 



heuristic methods. Such heuristic checks are traditionally 
used in statistical physics, e.g. [39l. li^. 

The data-driven models are used to approximate the 
stochastic evolution of black-box data and the estimated 
parameters do have a loose physical phenomenological 
interpretation. If one desires to compute unambiguous 
physical quantities from the estimated coefficients using 
a particular definition from statistical physics, the mod- 
els can also be used to generate data for this purpose. 
For example, surrogate models can generate nonequilib- 
rium (surrogate) work [Til . [T^ . [33 | and, under various 
assumptions, a well-defined thermodynamic potential of 
mean force (PMF) can be derived from such data [svllisj. 
This contrasts the case where one starts with a high di- 
mensional stochastic process (of known functional form) 
and then uses stochastic analysis to reduce the dimen- 
sionality of the system by first appealing to asymptotic 
arguments [38] and then possibly modifying the resulting 
equations to achieve a desired physical constraint [41] . In 
both analytical and data-driven cases, the goal is often 
to construct a single limiting low-dimensional evolution 
equation that can be used to predict statistical proper- 
ties of the complex system valid over longer time-scales 
[TgI [37l . m, [3^. It is not quantitatively clear at what 
timescale such an approximation (if any such useful ap- 
proximation exists at all) is valid over. Furthermore it 
is usually difficult to determine if an equilibrium concept 
such as a PMF connects simply to trajectory-wise kinet- 
ics in small complex systems experiencing fluctuations. 
Again, an appealing feature of the approach advocated 
here is that various statistical hypothesis tests [HjHSjUI] 
can be used to quantitatively assess the validity of pro- 
posed (reduced) evolution equations to see if physically 
convenient models are consistent with the observed data. 
For example, such tests can be used to determine the 
time one needs to wait before inertia can be neglected 

Regarding the validity of using a single surrogate SDE 
to approximate "long term" > 0{ns) trends, a main un- 
derlying theme of this paper and others [12, [3, IH, [13] 
is that the presence of a lurking slowly evolving degree 
of freedom, F, can significantly complicate using a single 
equation and that methods for quantitatively accounting 
for this sort of variation are underdeveloped. Information 
in these types of models have proven useful in both theo- 
retical chemistry computations fl^. [33 an d in character- 
izing nanoscale experimental data [26l.l27l. |45| . Through- 
out this article, it is shown how the collection of surro- 
gate models can be linked with the ideas of "dynamic- 
disorder" [l| to make quantitative statements about sys- 
tems observed at the single-molecule level. 



III. A METHOD FOR COMPUTING THE AC 
FUNCTION OF COMPLEX SYSTEMS 

Before providing the algorithmic details of the method, 
the basic idea and motivation for the approach is sketched 
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in words. It is assumed that a T type coordinate slowly 
evolves (diffusively) over a relatively flat region of an ef- 
fective free energy surface. This evolution modulates the 
stochastic dynamics of the order parameter modeled, e.g. 
it changes the diffusion coefficient function i^lSj, J^] . How- 
ever, due to the almost continuous nature of the param- 
eter change, a sudden or sharp change in the process dy- 
namics is assumed difficult to detect in short segments of 
the time series (sudden regime changes or barrier cross- 
ings are not readily apparent in the data). Over longer 
time intervals, the changes become significant and the va- 
lidity of a simple SDE model like the ones considered here 
to describe the global dynamics become suspect. How- 
ever if the evolution rules are updated as time progresses 
in the spirit of a dynamic disorder description then 
there is hope for using a collection of these models to 
summarize the dynamics. Even if the data is truly sta- 
tionary, some fluctuations due to a F type coordinate may 
take a long time to be "forgotten" [2, 15]. The idea pro- 
posed here is to essentially to use the estimated model for 
a time commensurate with time interval length used for 
estimation/hypothesis testing and then suddenly switch 
model parameters. By doing this, one can take a col- 
lection of fairly simple stochastic models and construct 
another stochastic process possessing a more complex AC 
function. 

One advantage of such a procedure is that an ensem- 
ble of elementary or phcnomenological pieces can be con- 
structed to gain a better understanding of how variation 
induced by slowly evolving fluctuations affects some sys- 
tem statistics and this information may help in better 
quantitatively understanding some recently proposed en- 
zyme mechanisms This method is in line 
with the single-molecule philosophy that dynamical de- 
tails should not be obscured by bulk averaging artifacts 
when possible. It is demonstrated how using a tradi- 
tional AC summary of the data would obscure informa- 
tion of this sort on a toy example. Since the timescales at 
which simulations and single-molecule experiments span 
are rapidly converging, this new type of dynamical sum- 
mary can also be used to help in matching the kinet- 
ics of simulations and experiments and/or can be used 
to understand how more complex dynamics emerge from 
simpler evolution rules [1^ [lo, IMI, [s^l • 

Now for the algorithm details. Recall that for a single 
trajectory coming from a high dimensional system, the 
time series data is divided into partitions and within each 
partition the parameters of both candidate models are 
estimated by methods discussed in the previous section. 
This results in a collection |6>^|^ , for a each trajectory 
observed. The Euler-Maruyama [46| scheme is used here 
to simulate Nmc trajectories for each 9 G {de}f^i. The 
surrogate SDEs are recorded every St and denote simu- 
lated order parameter time series by {{z'^'^''^}JiTi,_i} i-i 
with j — 1 , . . . Nmc ■ To construct a new time series us- 
ing the Nmc trajectories generated, set = {},£= 1 
and for t=l to n x L repeat the following: 



1. Draw Uniform Integer u €E [1,Nmc]- 

3. Update counter £ = mod(i, L) + 1 

The procedure described results in a new time series 
{xf}fLi where N' = n x N. Note that the time ordering 
of the original data is maintained and the last step forces 
the series {xf to be periodic so time lags > N6t can- 
not be resolved with this method. If the integer n > 1, 
the series {a:f}^i contains more temporal samples than 
the original series. A larger sample size reduces the sta- 
tistical uncertainty in an empirically determined AC. The 
issue of reducing uncertainty is subtle and is discussed in 
detail using the toy model presented in the next section. 
If the time ordering is believed irrelevant, the first step 
can be modified to drawing 2 random integers. The other 
random integer can be used to randomize the £ index. 

[la. 

This procedure can then be repeated for each trajec- 
tory coming from a high dimensional system. It is to be 
stressed that suddenly and relatively infrequently regime 
switches ("barrier hopping") cannot be described with 
this method. If the simulation or experiment is associ- 
ated with a system possessing a jagged/rough free energy 
surface with many small barriers and if a single trajec- 
tory can frequently sample the hops, then there is hope 
for using this method. However note that the method is 
designed to treat relatively smooth regime changes (i.e. 
regime changes hard to identify by simple visual inspec- 
tion). A discussion on how the surrogate models can 
be potentially used in more complex situations is briefly 
discussed later. 



IV. RESULTS AND DISCUSSION 
A. Toy Model 

In order to demonstrate the AC method on a simple 
example and illustrate some points in a controlled setting 
we use the following SDE model: 
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where the constants a^, k^, rp are meant to play the role 
of the surrogate parameters [A, B, C) in the OU model. 
In the above expressions, superscripts are used simply to 
distinguish different constants or processes and do not 
represent exponentiation. Superscripts on the dBt terms 
are used to distinguish separate independent standard 
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Brownian motions. The Roman numeral superscripts dis- 
tinguish three cases: I) the standard OU model; II) an 
OU type model where the mean level, a evolves stochas- 
tically; and III) an OU type model where all parameters 
evolve stochastically. The parameter tq dictates the time 
scale at which the OU parameters stochastically evolve. 
The evolution studied here is made to be slow relative 
to that dictated by k". The (assumed unobserved) pro- 
cesses at , Kt j\t are meant to mimic a dynamic disorder 
type situation. 
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FIG. 1: The AC of 4 realizations from 4 toy processes dis- 
cussed in the text (left). The standard deviation of the AC 
curves measured in a trajectorywise fashion from a population 
of 100 trajectories. The computed standard deviation reflects 
the variance in a population of 100 empirically determined 
AGs. 

In addition, a fourth process referred to as "III 
(Proxy)" will be evolved to demonstrate the AC method 
of Section imi This process is constructed by simply set- 
ting the parameters k", 77" equal to the corresponding 
parameters of process III at time t and then evolving this 
process like a standard OU model until the time index 
hits t + T when the parameters are updated to those 
of process III at the same time. This procedure is then 
iterated. Randomizing T had little influence on the accu- 
racy here, but can be entertained. The processes above 
are simulated using the Euler-Maruyama scheme with 
time step size bs and the process is observed discretely 
every bt time unit. The remaining parameters are tuned 
to provide a parameter distribution consistent with those 
observed in some DHFR studies and are reported in the 
Appendix. 

The toy model is used to investigate how variation 
induced by slowly evolving F type factors influence the 
computed empirical AC on a controlled example where 
the assumptions behind the method introduced are sat- 
isfied. The features discussed are relevant to the DHFR 
system studied later and are also likely relevant to oher 
single-molecules studies. The example is also used to 
highlight issues relevant to nonergodic sampling 14], i.e. 
when temporal averages are not equivalent to ensemble 
averages. In this type of situation, single-molecule data 
is particularly helpful. Use of the same Brownian motion 
term to drive three separate processes facilitates study- 
ing contributions to variance in these types of studies. 
In addition, estimation is not carried out to keep the 
discussion simple and to remove an additional source of 
uncertainty. 



The left panel of Fig. [T] plots the empirical AC com- 
puted by sampling 4 realizations from this process us- 
ing 5000 observations uniformly spaced by bt — O.lSps. 
These time series lengths are commensurate with those 
used in typical MD applications ^, . One observes 

that the slowly evolving parameters do influence the AC 
measured. The fairly simple method of periodically up- 
dating the evolution parameters is able to mimic the AC 
associated with y^^^ for both short and long time scales. 
Furthermore, the variation induced by the relaxation and 
noise level (modulated by Kt and r\t) influences both the 
short time and longer time responses. The stochastic re- 
sponse of a dynamic disorder type process is clearly richer 
than a single exponential. An advantage the surrogate 
approach offers over popular existing methods for treat- 
ing this situation ,28,] is that other kinetic schemes, e.g. 
those associated with overdamped models with position 
dependent diffusion, can be entertained. In enzymes as- 
sociated with complex dynamics, other kinetics schemes 
may be needed to accurately reflect the stochastic dy- 
namics of the order parameter monitored. For example 
it is demonstrated in Fig. [3] that the PDOD surrogate 
is needed accurately captured relaxation kinetics even at 
short 0{ps) timescales. Over timescales relevant to ex- 
perimentally accessible order parameters characterizing 
conformational fluctuations, one may need to account for 
dynamical resp onses much richer than a mixture of ex- 
ponentials [13I ITsl. [isj. The procedure presented demon- 
strated how "elementary" pieces could be patched to- 
gether to characterize relaxations/fluctuations occurring 
over longer timescales. This is attractive to both com- 
puter simulations and experimental data sets. In what 
follows the attention is shifted to focusing on limitations 
of using a single AC to describe single-molecule time se- 
ries. 

The right panel of Fig. [T] plots the standard devi- 
ation of the AC function associated with a trajectory 
population. For each observed trajectory, 100 different 
SDE trajectories were used to compute 100 empirical ACs 
from the time series associated with the trajectories. The 
pointwise standard deviation measured over the 100 ACs 
is plotted. The curve shadings distinguish different time 
series sample sizes. The three cases studied consisted of 
(0.5, 2, 8) X lO'' discrete temporal observations; each time 
series was uniformly sampled with 0.15ps between obser- 
vations. Note that the influence of the evolving parame- 
ters on the measured AC is substantial. Recall all y pro- 
cesses used common Brownian paths (so computer gener- 
ated random numbers do not contribute to the differences 
observed). In addition, observe that the difference be- 
tween y^^ and y^^^ persists for a fairly long time and the 
length of time this difference is measurably noticeable 
depends on the temporal sample size used to compute 
the AC. In some applications, the variation induced by 
conformational fluctuations is important in computations 
[S^ l or to characterize a system [6, 47]. The standard de- 
viation in the measured AC here contains contributions 
coming from factors meant to mimic the influence unre- 
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solved conformational fluctuations whose influence per- 
sists for a fairly long time. In the AC computed with 
longer time series, i.e. spanning a larger time since the 
time between observations is fixed, the process has more 
time to "mix" and hence the difference between tempo- 
ral and ensemble averages is reduced. Said differently, 
the influence of the initial conformation, or "memory" , 
diminishes. By using a single long time series trajectory 
and only reporting one AC computed from this "mixed" 
series, these types of physically relevant fluctuations can 
get washed out by using a single AC function. This goes 
against the spirit of single-molecule experiments. 

The example considered here is admittedly simple and 
was constructed to illustrate the types of assumption be- 
hind the method introduced. If the dynamic disorder 
induced by large kinetic barriers or a complicated inter- 
action with the surroundings, then one would need to 
construct more sophisticated processes for determining 
how and when the parameters regime switch. Combin- 
ing the surrogate models with efforts along these lines, 
e.g. [4§|, may be able to help these more exotic situ- 
ations. Exploring the various routes by which complex 
and/or heavy tailed ACs 0, can emerge from simpler 
dynamical rules can help in a fundamental understanding 
of the governing physics [H, [s^, [Sll, [3l|. However, if the 
ensemble average decay rate is deemed the only quan- 
tity of physical relevance then the collection of surrogate 
models can still potentially be used to help in roughly 
predicting the rate of decay of more complex ACs. This is 
particularly relevant to simulations where obtaining long 
enough trajectories to reliably calibrate models possess- 
ing complex AC exhibiting long r ang e dependence from 
observed data is problematic [IJ, [la, 50}. Even in cases 
where one only requires the asymptotic time decay of 
an ensemble of conformations for a physical computation 
[sH and can simulate for a long enough time to directly 
monitor kinetics, an understanding of the distribution of 
surrogate models estimated will likely be of help in link- 
ing computer simulation force fields to single-molecule 
experimental time series. The remaining results use sim- 
ulations of DHFR to illustrate some of these points. 



B. Dihydrofolate Reductase(DHFR) 

1. DHFR Simulation Details 

The detailed computational details are reported in Ref. 
[S]. Briefly, an order parameter denoted by AZ^rmsd, was 
defined using the root mean square distance between 
two crystal structures 8]. This order parameter pro- 
vides an indication of the proximity to the "closed" and 
"occluded" enzyme state and is reported in units of A 
throughout. The initial path between the closed and oc- 
cluded conformations of DHFR was generated using the 
Nudged Elastic Band (NEB) method [H^. Subsequently, 
w50 configurations obtained from NEB path optimiza- 
tion were subjected to US simulations. During these US 




FIG. 2: Average local diffusion function estimated (left axis) 
and free energy (right axis) as a function of order parame- 
ter. The population average of the estimated C is plotted to 
give a feel for the position dependence of this quantity; it is 
stressed that the average alone is not adequate the describe 
the dynamics here due to a type of "dynamic disorder" phe- 
nomena [l|. The free energy was computed by K. Arora using 
methods described in Ref. B. 



simulations, production dynamics of 1.2ps at 300K was 
performed after equilibration using a weak harmonic re- 
straint. 



2. DHFR Results 

Figure [2] plots the average local diffusion coefficient of 
the surrogate SDE models using two different observation 
frequencies on the left axis and on the right axis the free 
energy computed in Ref. [8] is plotted. Each surrogate 
model was estimated using 400 time series observations 
with either 6 = 0.15 or S = 0.30 ps separating adjacent 
observations corresponding to L = 20 or L = 10 (respec- 
tively). The average local diffusion coefficient demon- 
strates a relatively smooth increasing trend for a ma- 
jority of the order parameter values explored, but then 
suddenly changes abruptly around ADj-msd ~ 3A. It has 
been observed that an interesting interplay between free 
energy, fluctuations and stiffness, exists in some enzyme 
systems S S Bill and this plot suggests that future 
works investigating some of the finer structural factors 
leading to these change may be worthwhile, though this 
direction is left to future work because it is outside the 
scope of this study. 

It is to be stressed that the mean of each US window 
is not adequate to summarize the dynamics. That is, a 
single fixed parameter surrogate SDE like the ones con- 
sidered here cannot mimic the longer time statistics of 
the process. This is why the AC procedure introduced 
in Section HITl is needed. Figure [3] demonstrates that the 
individual PDOD models do capture features simpler sur- 
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rogates cannot. This is due part to the position depen- 
dence of the local diffusion coefficient. The PDOD sur- 
rogate model combined with the procedure of Section Iflll 
can accurately summarize the long time dynamics. These 
points are explained further in the discussion associated 

with Figs, mi 
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FIG. 3: Local AC Function of MD data and that predicted by 
surrogates. The total time series of length 1.2 ns was divided 
into separate blocks, each containing 400 observations spaced 
by 0.15 ps. This data was used to estimate a collection of 
surrogate models and a collection of MD AGs corresponding 
to US target points of ADrmad « -1-8, 0.9, 1.9 and 3.lA. The 
AG corresponding to the estimate surrogate models is also 
plotted where the initial lag is normalized to unity to facilitate 
comparison (the raw units are shown in Fig. |3}. 



4D „=-1.8 
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FIG. 4: The global AG prediction. The procedure intro- 
duced here is used along with PDOD data to predict the long 
time AG. The relaxation predicted by the individual surro- 
gate (data shown in Fig. [2] is also shown to stress that the 
SDE parameters are not fixed, but evolving and any single 
surrogate cannot capture the richer long time dynamics. 



The ability of the PDOD model to capture features 
that a single exponential (e.g. the AC associated with an 
OU process) cannot is demonstrated in Fig. [3l Results 
from four different US points, each possessing a different 
degrees of position dependence on the noise are shown. 
Here the results obtained using both the OU and PDOD 
surrogates calibrated using the St — 0.15ps with 400 tem- 
poral observations and the corresponding AC predictions 
are shown in the plot. The empirical ACs computed using 
the short segments of MD data used for surrogate model 
parameter estimation are also reported. Results with 
400 blocks possessing observations spaced by St — O.SQps 
were similar in their AC prediction, but hypothesis tests 
strongly rejected the assumption of a fixed local diffu- 
sion (see Fig [5]). The 400 St = 0.1 5ps samples allowed 
the OU model structure to provide a better fit (as mea- 
sured the fraction rejected) because the local diffusion 
function had less time to evolve/change value. For cases 
where the position dependence is moderate, the PDOD 
and OU surrogate models predict qualitatively similar 
AC functions. However, the PDOD model captures the 
short time relaxation dynamics better than the OU for 
cases where the position dependence of the local diffusion 
is more substantial and hence for clarity we focus on the 
PDOD models in the remaining kinetic studies. 




^ -4 -3 -2 -1 1 2 3 4 

rmsd 

FIG. 5: Goodness-of-fit tests. The test of Ref. [25| ] was com- 
puted given the parameter estimate and the observed data. 
The median of each umbrella sampling window is reported. 

Figure [5] plots the empirically determined AC obtained 
from different MD production simulation data. The case 
labelled "in-sample" was the one used for estimation of 
the local models reported in Fig. [3] and that labeled 
"out-of-sample" was computed by running a longer 3.6 
ns simulation and computing the AC from the last 1.2 ns 
of this time series. The PDOD version of these models 
were used along with the procedure outlined in Section|TTT] 
using blocks of size 800 and randomizing the time index. 
The 400 blocks results were similar. Respecting the time 
ordering of the surrogate models only improved results 



8 



marginally. Note also that the general trends of the long 
time decay of the MD data is captured with the proce- 
dure and that there is substantial difference between the 
"in-sample" and "out-of-sample" MD trajectories [sst - 
The physical relevance of such variation was previously 
discussed and will be expanded on when results of sta- 
tionary DHFR density prediction are shown. The pri- 
mary observation is that a collection of PDOD surrogate 
models were able to capture the basic relaxation trends of 
the enzyme that a single surrogate could not. Recall that 
even at short timescales a single exponential decay was 
inadequate to fit the data. Similar trends were observed 
for all 51 US windows explored. However, it is to be 
stressed that the procedure shown here is to decompose 
kinetics in the longest contiguous block of discrete time 
series observed. If complex dynamics occur over longer 
timescales and data is not available that directly samples 
these scales, then the method cannot be used to predict 
the long time behavior that was not sampled. 




FIG. 6: Stationary density /histogram prediction. The bars 
denote 1.2 ns MD data, the solid thick line to the mixture 
PDOD method (see text), the solid dotted line to the average 
PDOD model, and the thin lines to the 4 local equilibrium 
densities used in constructing the mixture density. 

The goodness-of-fit of the surrogate models using the 
two candidate SDEs is shown in Fig. \5\ for various US 
windows. The median of the Q-test statistic introduced 
by [2^ is reported. This test statistic under the null is 
asymptotically normally distributed with mean zero and 
unit variance, but has also been proven to be useful in 
small samples [Ti, 1251, fsi, lii] . Recall that each MD time 
series (at each umbrella sampling window) was divided 
into small pieces. In the portions near the edges (larger 
|Armsd| values), where the position dependence of the 
noise is greatest, one observes that the OU model popu- 
lation has a median that would typically indicate a collec- 
tion of poor dynamical models. If conformational fluctu- 
ations slowly modulate the dynamics, the longer the time 
series one has, the likelihood of departing from any sim- 
ple surrogate model increases (5^. Goodness-of-fit tests, 



like the one presented here, can be used to quantitatively 
approximate when simple models begin departing from 
various assumptions. 

The stationary density predicted by the surrogate OU 
models in a case where position-dependence was shown 
to be marginal for the time interval data was monitored is 
plotted in Fig. [S) Here the mixture method discussed in 
Ref. [l3| is reported due to its relevance to a collection 
of surrogates and dynamic disorder. The histogram of 
the 1.2ns MD data is also plotted as well as the station- 
ary density predicted by a the average of the surrogate 
models taken at the US window near A_Di-,„sd ~ 2. Us- 
ing a single model obtained by aggregating all time series 
together in hopes of reducing surrogate parameter uncer- 
tainty due to the resulting smaller time series sample sizes 
actually worsens the results. The mixture of OU models 
calibrated using 4 sets of noncontiguously spaced 60ps 
data (i.e. 400 entries spaced 0.15ps) were sampled every 
300 ps from the MD process and this was used to compute 
4 surrogate OU model parameters. The goodness-of-fit 
tests indicated that the local surrogates given the data 
were reasonable dynamical models. So portions where 
the "local equilibrium" density, i.e. the stationary den- 
sity predicted by a surrogate with estimated parameters, 
possessing significant probability mass can be though of 
the regions of phase space sampled due to fast-scale mo- 
tion for a relatively fixed (and unobserved) value of F 
[26| . If variation in the conformational coordinate is 
important to thermodynamic averages, as the data here 
suggests to be the case in DHFR, then one needs to use 
a collection of "local equilibrium" densities [l3| • The ad- 
vantage of such an approach is that short bursts of sim- 
ulations started from different initial conditions can be 
run, then surrogate models can be calibrated and tested. 
If the surrogate is found suitable, it can then be used to 
make predictions on the local equilibrium density, and 
the variation in the local equilibrium densities can be 
used to partially quantify the degree to which a slow con- 
formational degree of freedom modulates the dynamics. 
This treatment is appealing when data on other physi- 
cally relevant order parameters is unknown are not easy 
to access. 



V. SUMMARY AND CONCLUSIONS 

Single-molecule experiments and simulations offer the 
potential for a detailed fundamental understanding of 
complex biomolecules without artifacts of bulk measure- 
ments obscuring the results. However, one must deal 
with complex multiscale fluctuations at this level of res- 
olution and the factors contributing to the noise often 
contain physically relevant information such as quantita- 
tive information about conformational degrees of freedom 
[2^ . The abundance of data available to researchers and 
recent advances in computational and statistical meth- 
ods are allowing researchers to entertain new methods of 
summarizing information relevant to modeling systems 
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at the nanoscale [H, [H, \^ . 

By applying surrogate models to the data coming from 
biased MD simulations of DHFR, it was demonstrated 
that a collection of stochastic dynamical models can be 
used to better understand the factors contributing to 
the shape of the autocorrelation function associated with 
fluctuations coming from multiple time scales. The sur- 
rogate models were estimated by appealing to maximum 
likelihood type methods [l8|, [l^, |20[ and were tested using 
goodness-of-fit tests which utilized the transition density 
of the assumed surrogate and were appropriate for the 
data. For example, the time series data was not assumed 
to be stationary; the stationarity assumption is often sus- 
pect in simulation data. The tests used [1^ indicated 
that taking the position dependence of the noise into 
account was required to provide a statistically accept- 
able model in many regions of phase space explored. For 
short timescales, the individual surrogate models (taking 
position dependence noise into account) were capable of 
predicting quantities outside of the fitting criterion, e.g. 
a parametric likelihood function was fit but the mod- 
els were able to predict short timescale autocorrelation 
functions and these physically based models were able 
to fairly accurately summarize/model relaxation kinetics 
that a simple exponential relaxation could not. Other en- 
zymes systems have exhibited this type of behavior [Ts'l 
and it is likely that future single-molecule experiments 
will yield data possessing this feature. 

Perhaps more importantly, we demonstrated that a 
population of surrogate models was required to represent 
the complex dynamical system because an unobserved 
conformational degree freedom modulated the dynamical 
response and this "random force" had to be accounted for 
in order to predict autocorrelations valid for longer tem- 
poral trajectories. A method using parametric surrogate 
models calibrated over short timescales while at the same 
time respecting the variability induced by unresolved co- 
ordinates evolving over longer timescale was presented. 
The DHFR system was another instance where aggre- 
gating a collection of simpler dynamical models gave rise 
to a more complex stochastic process |13|, i29|, ,30^ i31t i3^J ■ 



The basic idea is applicable to situations where a hidden 
slowly evolving degree of freedom modulates the dynam- 
ics and this coordinate evolves on an effective free energy 
surface possessing relatively low barriers ^13^. Issues as- 
sociated with extensions were briefly discussed. 

Even if a coarse system description, such as a single au- 
tocorrelation function, can be used to adequately approx- 
imate the physically relevant statistical properties of all 
experimentally accessible observables, the approach pre- 
sented still has appeal. One circumstance where this is 
particularly relevant is when computer simulation trajec- 
tories are compared to fre quen tly sampled experimental 
single-molecule time series [2g| . In experimental time se- 
ries, many conformational coordinates cannot typically 
be resolved [26j . so constructing a simulation that 
matches all relevant degrees of freedom is highly prob- 
lematic. Quantitative knowledge of how the variability 
induced by such hidden degrees of freedom is reflected in 
the surrogate model parameters distribution may help in 
refining force fields to match kinetic properties at multi- 
ple timescales. If the force fields are believed valid, then 
turning to the simulations for details of the structural 
dynamics can help us in understanding complex molec- 
ular machines [53|. This type of extra detail may also 
assist (or lead to new) methods for computing transi- 
tion rates [s^ . Furthermore, as nanotechnology demands 
higher resolution at smaller length and timescales, one 
may want to avoid using a single autocorrelation func- 
tion constructed by aggregating many meso or micro- 
scopic states each possessing different dynamical features 
because doing so may unnecessarily wash out physically 
relevant information. The phenomenologically motivated 
simple bottom-up strategy presented was one contribu- 
tion in this direction. 



VI. ACKNOWLEDGEMENTS 

The author thanks Karunesh Arora and Charles 
Brooks III for sharing the DHFR data. 



[1] H.P. Lu, L. Xun, and X.S. Xie. Single-molecule enzy- 
matic dynamics. Science, 282:1877-1881, 1998. 

[2] S.C. Kou and X.S. Xie. Generalized Langevin equation 
with fractional Gaussian noise: Subdiffusion within a sin- 
gle protein molecule. Phys. Rev. Lett, 93:180603, 2004. 

[3] M. Vendruscolo and C. M. Dobson. Dynamic visions of 
enzymatic reactions. Science, 313:1586, 2006. 

[4] K.A. Henzler-Wildman and et al. Intrinsic motions along 
an enzymatic reaction trajectory. Nature, 450:06410, 
2007. 

[5] O. Miyashita, J. N. Onuchic, and P. G. Wolynes. Nonlin- 
ear elasticity, proteinquakes, and the energy landscapes 
of functional transitions in proteins. Proc. Natl. Acad. 
Sci. U.S.A., 100:12570-12575, Oct 2003. 

[6] CP. Calderon and K. Arora. Extracting kinetic and sta- 



tionary distribution information from short md trajec- 
tories via a collection of surrogate diffusion models. J. 
Chem. Theory Comput., in press. 

[7] P. C. Whitford, J. N. Onuchic, and P. G. Wolynes. En- 
ergy landscape along an enzymatic reaction trajectory: 
hinges or cracks? HFSP J, 2:61-64, Apr 2008. 

[8] K. Arora and C. L. Brooks lii. Functionally Important 
Conformations of the Met20 Loop in Dihydrofolate Re- 
ductase are Populated by Rapid Thermal Fluctuations. 
J. Am. Chem. Soc, Mar 2009. 

[9] J. P. Junker, F. Ziegler, and M. Rief. Ligand-dependent 
equilibrium fluctuations of single calmodulin molecules. 
Science, 323:633-637, 2009. 
[10] S. Tripathi and J. J. Portman. Inherent flexibility de- 
termines the transition mechanisms of the EF-hands of 



10 



calmodulin. Proc. Natl. Acad. Sci. U.S.A., 106:2104- 
2109, Feb 2009. 

CP. Caldoron. On the use of local diffusion for path [30 
ensemble averaging in potential of mean force computa- 
tions. J. Chem. Phys., 126:084106, 2007. 

CP. Calderon and R. Chelli. Approximating nonequilib- [31 
rium processes using a collection of surrogate diffusion 
models. J. Chem. Phys., 128:145103, 2008. 

CP. Calderon and K. Arora. Extracting kinetic and sta- [32 
tionaxy distribution information from short md trajec- 
tories via a collection of surrogate diffusion models. J. 
Chem. Theory & Comput, 5:47, 2009. 

Ariel Lubelski, Igor M. Sokolov, and Joseph Klafter. Non- [33 
ergodicity mimics inhomogeneity in single particle track- 
ing. Physical Review Letters, 100(25) :250602, 2008. [34 
S. Kou. Stochastic modeling in nanoscale biophysics: 
subdiffusion within proteins. Annals of Applied Statistics, 
2:501-535, 2008. 

G. Hummer. Position-dependent diffusion coefficients [35 

and free energies from Bayosian analysis of equilibrium 

and replica molecular dynamics simulations. New J. [36 

Phys., 7:34, 2005. 

J. Chahine, R.J. Oliveira, V.B.P. Leite, and J. Wang. [37 
Configurational-dependent diffusion can shift the kinetic 
transition state and barrier height of protein folding. 
Proc. Natl. Acad. Sci. USA, 104:14646, 2007. [38 

Y. Ai't-Sahalia. Maximum-likelihood estimation of 
discretely-sampled diffusions: A closed-form approxima- 
tion approach. Econometrica, 70:223-262, 2002. 
J.C Jimenez and T. Ozaki. An approximate innovation [39 
method for the estimation of diffusion processes from dis- 
crete data. J. Time Series Analysis, 27:77-97, 2006. 
S. Chen and Tang CY. Parameter estimation and 
bias correction for diffusion processes. J. Econometrics, [40 
149:65-81, 2009. 

Y. Pokern, A.M. Stuart, and E. Vanden-Eijnden. Re- [41 
marks on drift estimation for diffusion processes. Multi- 
scale Model. SimuL, in press, 2009. 

CP. Calderon. Fitting effective diffusion models to data 
associated with a "glassy potential" : Estimation, classi- [42 
cal inference procedures and some heuristics. Mutliscale 
Modeling and Simulation, 6:656-687, 2007. 
Yacine Ait-Sahalia, Jianqing Fan, and Heng Peng. Non- 
parametric transition-based tests for jump-diffusions. [43 
JASA, in press, 2009. 

S.X. Chen and CY. Tang. A tost for model specification 

of diffusion processes. Annals of Statistics, 36:167-198, [44 

2008. 

Y. Hong and H. Li. Nonparametric specification test- 
ing for continuous-time models with applications to term [45 
structure of interest rates. The Review of Financial Stud- 
ies, 18:37-84, 2005. 

CP. Calderon, N. Harris, C.-H. Kiang, and D.D. Cox. [46 
Quantifying multiscale noise sources in single-molecule 
time series via pathwise statistical inference procedures. [47 
J. Phys. Chem. B, 113:138, 2009. 

CP. Calderon, W.-H. Chen, N. Harris, K.J. Lin, and 
C.-H. Kiang. Analyzing DNA melting transitions using 
single-molecule force spcctrscopy and diffusion models. [48 
J. Physics: Condensed Matter, 21:034114, 2009. 
N. D. Socci, J. N. Onuchic, and P. G. Wolynes. Diffusive 
dynamics of the reaction coordinate for protein folding [49 
funnels. J. Chem. Phys., 104:5860-5868, 1996. 
C. W. J. Granger. Long memory relationships and the 



aggregation of dynamic models. Journal of Econometrics, 
14(2):227 - 238,' 1980. 

C. W.J. Granger and R. Joyeux. An introduction to long- 
memory time series models and fractional differencing. J. 
Time Series. Analysis, 1:15-30, 1980. 

D. R. Cox. Long-range dependence, non-linearity and 
time irreversibility. J. Time Series. Analysis, 12:329-335, 
1991. 

Iddo Eliazar and Joseph Klafter. From ornstein- 
uhlenbeck dynamics to long-memory processes and frac- 
tional brownian motion. Physical Review E (Statistical, 
Nonlinear, and Soft Matter Physics), 79(2):021115, 2009. 
B.L.S. Prakasa Rao. Statistical Inference for Diffusion 
Type Processes. Hodder Arnold, 1999. 
CP. Calderon, L. Janosi, and I. Kosztin. Using stochastic 
models calibrated from nanosecond nonequilibrium simu- 
lations to approximate mesoscale information. J. Chem. 
Phys., 130:144908, 2009. 

R. Zwanzig. Nonequilibrium Statistical Mechanics. Ox- 
ford University Press, New York, 2001. 
CW. Gardiner. Handbook of Stochastic Models. Springer- 
Verlag, Berlin, 1985. 

S. Park and K. Schulten. Calculating potentials of mean 
force from steered molecular dynamics simulations. J. 
Chem. Phys., 120:5946-5961, 2004. 

R. Kupferman, , G.A. Pavliotis, and A.M. Stuart. Ito 

versus stratonovich white noise limits for systems with 
inertia and colored multiplicative noise. Phys. Rev. E, 
70:036120, 2004. 

D.I. Kopelevich, A.Z. Panagiotopoulos, and I.G. 
Kevrekidis. Coarse-grained kinetic computations for rare 
events: Application to micelle formation. J. Chem. 
Phys., 122:044908, 2005. 

H. Risken. The Fokker-Planck Equation. Springer- Ver lag, 
1996. 

Peter Arnold. Langevin equations with multiplicative 
noise: Resolution of time discretization ambiguities for 
equilibrium systems. Phys. Rev. E, 61(6):6091-6098, Jun 
2000. 

Cristian Micheletti, Giovariiu Bussi, and Alessandro 
Laio. Optimal langevin modeling of out-of-equilibrium 
molecular dynamics simulations. The Journal of Chemi- 
cal Physics, 129(7):074105, 2008. 

G. Hummer and A. Szabo. Free energy recoiistniction 
from nonequilibrium single-molccule pulling experiments. 
Proc. Natl. Acad. Sci. USA, 98:3658-3661, 2001. 
CP. Calderon. Local diffusion models for stochastic re- 
acting systems: estimation issues in equation-free numer- 
ics. Mol. Sim., 33:713-731, 2007. 

CP. Calderon, N. Harris, C.-H. Kiang, and D.D. Cox. 

Analyzing single-molecule manipulation experiments. J. 
Mol. Recognit., in press, 2009. 

P. Kloeden and E. Platen. Numerical Solution of Stochas- 
tic Differential Equations. Springer- Verlag, 1992. 
K. Arora and C. L. Brooks, HI. Large-scale allosteric 
conformational transitions of adenylate kinase appear to 
involve a population-shift mechanism. Proc. Natl. Acad. 
Sci. USA, 104:18496-18501, 2007. 

S. V. Krivov and M. Karplus. Diffusive reaction dynamics 
on invariant free energy profiles. Proc. Natl. Acad. Sci. 
U.S.A., 105:13841-13846, Sep 2008. 

I. Horenko and Schutte C. Likelihood-based estimation 
of multidimensional langevin models and its application 
to biomolecular dynamics. Mult. Mod. Sim., in press. 



11 



[50] Nonlinear stochastic models of 1/f noise and power- law 
distributions. Phystca A: Statistical Mechanics and its 
Applications, 365(1):217 - 221, 2006. 

[51] P. Maragakis, K. LindorfT-Larsen, M.P. Eastwood, R.O. 
Dror, J.L. Klepeis, I.T. Arkin, M.O. Jensen, H. Xu, 
N. Trbovic, R.A. Friesner, A.G. Palmer, and D.E. Shaw. 
Microsecond molecular dynamics simulation shows effect 
of slow loop dynamics on backbone amide order parame- 
ters of proteins. J. Phys. Chem. B, 112:6155-6158, 2008. 

[52] Jhih-Wei Chu, Bernardt L. Trout, and Bernard R. 
Brooks. A super-linear minimzation scheme for 
the nudged elastic band method. J. Chem. Phys., 
119(24):12708-12717, 2003. 

[53] M. Sotomayor and K. Schulten. Single-molecule exper- 
iments in vitro and in silico. Science, 316:1144 - 1148, 
2007. 

[54] P. G. Bolhuis, C. Dellago, and D. Chandler. Reaction 
coordinates of biomolecular isomerization. Proc. Natl. 
Acad. Set. U.S.A., 97:5877-5882, May 2000. 

[55] Y.A. Kutoyants. Statistical Inference for Ergodic Diffu- 
sion Processes. Springer, New York, 2003. 

[56] We assume here that the initial condition is not random 
and hence does not contribute to the log likelihood func- 
tion. 

[57] Using a randomized i index turns out to be adequate for 
DHFR (the accuracy gain of respecting the time order- 
ing provided only marginal improvements), though we 
present the version respecting time ordering in the algo- 
rithm to simplify the exposition. 

[58] The large differences persist even if the time series length 
is increased by a factor of 3. 

[59] As always, when the amount of evidence increases, the 
likelihood of rejecting any over simplified model in- 
creases. However the tests developed in Ref. [2^ are as- 
sociated with "diagnostics" which can help one in de- 
termining if a rejected model might still contain useful 
information nonetheless. 



VII. APPENDIX 

A. Toy Model Parameters 

St = 0.15, (5s = <5t/50,r = 200 x 5t,To = 120, (0°, k°, = 
(4,0.2,0.5),(cr",(T«,(T'') = (6.5x10-2,6.5x10-3,1.9x10-2). The 
last set of parameters were selected to give the evolving OU pa- 
rameters a stationary distribution characterized by three indepen- 
dent normals each having mean (oP , ,rp) and standard deviation 
(1/2,1/20, 3/20). The initial condition of each y process was set to 
a" and the OU parameters were all set to {oP , K^,r]'^). 100 batches 
of 4 independent Brownian motion processes were used to evolve 
the system. 



B. Predicting Quantities with Surrogate Models 

The OU process is attractive for a variety of reasons. Its con- 
ditional and stationary density are both known analytically and it 
can be readily estimated from discrete data. For parameters pos- 
sessing a stationary distribution, these can all be written explicitly 
in terms of Normal densities. Another appealing feature is that the 
AC function, denote this function by AC(t), [2^ associated with 



a stationary process can readily be computed after parameter es- 
timates are in hand, namely AC{t) = exp —Bt; recall the drift of 
the OU process is given by B{A — z). 

Unfortunately, these types of statistical summaries are more dif- 
ficult to obtain with other SDEs. Position dependence of the diffu- 
sion function and nonlinear models severely complicate obtaining 
analytic expressions for the autocorrelation function. Note that 
once a single SDE models is estimated, a new large collection of 
sample paths can be simulated and quantities like the autocor- 
relation function associated with a given SDE model and can 
be empirically determined (the computational cost of simulating a 
scalar SDE is typically marginal in relation to a MD simulation). 
This can be repeated for each surrogate SDE estimated from each 
MD path. 

However, a stationary density, under mild regularity conditions, 
of a scalar SDE can often be expressed in closed-form using only 
information contained in the estimated SDE coefficient functions 
via the relation [40l . |55| 

p-'^'^(z;r) = Z/a{zfexp(^r fi{z')/a{z'fdz'y (5) 

where in the above the SDE functions' dependence on 8 and T has 
been suppressed to streamline the notation. Z represents a con- 
stant to ensure that the density integrates to unity and zref repre- 
sents an arbitrary fixed reference point. When evaluating p'^^{-), 
one can encounter technical difficulties if the diffusion coefficient 
is allowed to take a zero or negative value (this is relevant to the 
PSOD model). Some heuristic computational approaches to deal- 
ing with this are discussed in Refs. [T3I. l44j| . 

Sometimes a thermodynamic motivation exists for expressing 
the stationary density of the high-dimensional molecular system in 
terms of some potential, denoted here by y(z), that does not explic- 
itly depend on the diffusion function [391. |40(I . In time- homogeneous 
scalar overdamped Brownian dynamics, where the forces of interest 
acting on z are believed related to the gradient of V{z), a "noise- 
induced drift" term [if^ can be added to the drift function and this 
addition cancels out the contribution coming from the l/o-{z)^ term 
outside the exponential. The stationary density of the modified 
SDE can then be expressed as being proportional to exp ( — V{z)) 
in such a situation. This type of modification has a thermodynamic 
appeal when z is the only important variable of the system and the 
fast-scale noise has been appropriately dealt with 1381 . The utility 
of such an approach in describing the pathwise kinetics of trajecto- 
ries is another issue and single-molecule studies are one area where 
the distinction may be important (one may not care as much about 
the stationary ensemble distribution) . 

However when there are slowly evolving lurking variables like 
r modulating the dynamics (as is the case in many biomolecular 
systems) using simple expression like Eq. [5] to approximate the 
stationary density of the high-dimensional system (with or without 
"noise-induced drift" corrections) is highly problematic. Note that 
the F variable has been retained in the left hand side of Eq. \5\ the 
stationary density estimate is only meant to be valid for a fixed 
estimated SDE surrogate corresponding to one value 9. In this 
paper and others, it is assumed that for a short time interval both 
8 and F are effectively frozen. Given a model and short time data, 
this can be tested using goodness-of-fit tests. However over longer 
timescales, F evolves and modulates the dynamics so the estimated 
8 evolves in time (this is why the situation can be though of as a 
type of dynamic disorder [ll). For this long time evolution, it is 
assumed that the form of a stochastic process depending only on 
z is completely unknown to the researcher. Furthermore it was 
assumed that another order parameter (i.e. system observable) is 
unavailable or is unknown 13, 26, 48]. Hence to approximate the 
stationary distribution of the high-dimensional molecular system 
one would require a collection of p'^^'s (each with different F's) 
to approximate this quantity. This procedure is presented in Ref. 



